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Abstract —It is shown that a consistent application of Bayesian 
npdating from a prior probability density to a posterior nsing 
evidence in the form of expectation constraints leads to exactly 
the same resnits as the application of the maximum entropy 
principle, namely a posterior belonging to the exponential family. 
The Bayesian updating procedure presented in this work is 
not expressed as a variational principle, and does not involve 
the concept of entropy. Therefore it conceptually constitutes a 
complete alternative to entroplc methods of Inference. 

I. Introduction 

The principle of maximum entropy (MaxEnt for short) is 
considered a fundamental tool in the application of information 
theory to the construction of probabilistic models HI. First 
proposed in the context of Statistical Mechanics by J. W. 
Gibbs El as a derivation of the canonical ensemble from the 
requirement of fixed average internal energy, it was recast by 
Jaynes El a general principle of reasoning following the 
insights of Shannon Q. 

This principle of reasoning under uncertainty, that we call 
MaxEnt, postulates that the most unbiased probability assign¬ 
ment p*{u) among those probability densities p{u) consistent 
with given information E, is the one that maximizes the 
universal entropy functional 

S[p;Po] = -[ pHln-^^, (1) 

Ju PoW 

under the constraints imposed by E. Here U is an n- 
dimensional state space and u G U a possible state of the 
system. The density po{u) is the initial probability density 
representing complete ignorance about the value of u. 

The maximization problem is solved using the method 
of Lagrange multipliers, and for the case of information E 
expressed as the given expectation = F the solution 

is well-known 0, 0, and corresponds to the standard expo¬ 
nential family of models, 

= ^^Po(M)exp(-A/(M)) (2) 

where A is the Lagrange multiplier which can be determined 
from the implicit equation 

-AinZ(A) = E. (3) 

The uniqueness of the MaxEnt procedure relies on the 
uniqueness of the entropy functional S of Eq. [1] and this has 
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been established from consistency requirements |]6l, El, 0, 
0. These axiomatic approaches have allowed a useful sepa¬ 
ration between the operational formalism of MaxEnt updating 
(in the sense of formulating an optimal inference rule) and the 
meaning of the entropy S being maximized. 

In the purely operational vision, entropy is just a convenient 
device which allows the ranking of the different candidate 
distributions from best to worst, and we do not need to assign 
any meaning to it. However, this ranking must, at some point, 
be understood in terms of some quality of the candidate models 
in terms of which they are rejected or accepted. According to 
the standard interpretation of the entropy S by Shannon (later 
adopted by Jaynes), the quantity 5 is a measure of the missing 
information needed to completely determine the state u. In 
maximizing S we are choosing the least informative model 
that agrees with all our constraints. 

The original entropy axioms by Shannon 0 as well as 
the Shore and Johnson axioms 0 have been recenly put into 
question EqI because they rule out the so called generalized 
entropies (such as the Tsallis or Renyi entropies) for use 
in inference in. In this ongoing development it would be 
highly desirable to have alternative methods for updating prob¬ 
abilities, and an obvious candidate is Bayes’ theorem itself. 
Both MaxEnt and Bayes’ theorem solve essentially the same 
problem, namely updating a prior to a posterior incorporating 
some new information or evidence, regardless of the nature of 
that evidence (measured data or model constraints). Looking 
closely, we can in fact recognize the exponential form given 
in Eq. 0 as a Bayesian updating from a prior distribution 
po{u) to a posterior distribution p*{u) = M{u)po{u) with 
an “updating factor” 

M(m) cx exp ( - ^ Ai/i(M)). (4) 

i 

If we can arrive at this form of M without invoking a max¬ 
imization of some functional, then we can effectively bypass 
the issue of adequacy of S. In a frequentist context (taking 
expectations as the limit of statistical averages) Campenhout 
and Cover ifT^ have produced this factor M in the limit of 
infinite samples. 

In this work we provide a general derivation of the update 
rule in Eq.0from Bayesian inference subjected to expectation 
constraints. Unlike the result by Campenhout and Cover, the 
proof does not depend on frequentist assumptions such as the 
identification of expectations with averages over samples, or 
even the assumption of data samples being processed. It is 
based simply on imposing consistency conditions (in similar 
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Spirit to Cox m and Shore and Johnson) on the Bayesian 
updating rules themselves to constrain the form of M. 

The work is organized as follows. First, in section [III we 
present the notation to be used in the rest of the paper and give 
an outline of the assumptions needed for the central claim of 
the paper. In sections Hill |IV] and |V] the exponential form is 
obtained from the assumptions. 

II. Mathematical notation and statement of the 

DERIVATION 

We will consider in the following a system described by 
states u in an n-dimensional state space U . We start from 
an arbitrary state of knowledge /q with probability density 
P(m|/o) and our goal is to perform a Bayesian update to a 
posterior density P{u\I) where / = Iq A E is the new state of 
knowledge that includes the evidence E. We will denote the 
expectation of a quant^ under the state of knowledge 

J by (a) j, defined a^ 

a) = [ d^u P{u\J)Aiu). (5) 

' J Ju 

For the problem of inference with prior P{u\Iq) and ev¬ 
idence E given by = F, we define the following 

consistency conditions; 

(a) The probability ratio P{E\u, Iq)/P{E\Iq) is a unique 
functional of the constraining function /, evaluated 
using the state u and the constraining value F. That 
is, we can write 


In the following we will show that these conditions imply 
a posterior 

P{u\E,Iq) = -^^exp(-A/(M))P(M|/o). (7) 

Here A is a parameter to be fixed by the value of F, not 
a Lagrange multiplier, as there is no variational procedure in 
the derivation. 

The proof is divided in two parts. First, we prove that 
conditions (a) and (b) imply that the functional M[/](m,F) 
is actually a function of two arguments, m{f{u),F). Finally, 
we show that condition (c) implies m{f,F) oc exp(—A(F)/), 
which immediately leads to Eq. |7] after normalization. 

III. Consistency between different 

REPRESENTATIONS 

The evidence E, consisting of the given expectation {f)j = 
F, can be used in two different ways. If we regard the quantity 
/ as a variable in itself, it can be used to update the prior 
P{f\Io) to a posterior P{f\I), given by 

P{f\I) = P{f\Io)M[I]{f,F). (8) 

Here the functional depends on the identity function / —/, 
denoted by I. On the other hand, we can take / as a function 
of the degrees of freedom u, and use the evidence E to update 
the prior P{u\Io) to a posterior P{u\I), 

P{u\I)=P{u\Io)M[f]{u,F). (9) 


PjEjuJo) 

PiE\Io) 


M[f]{u,F). 


( 6 ) 


The functional M encodes the method of inference, and 
we are looking for a unique method. The only informa¬ 
tion we have in using this method is the evidence E, 
which consists of the function / and its expectation F, 
so M cannot depend on any other piece of information, 
such as additional parameters. 


For both problems the same functional M should be used 
and yield consistent posteriors. Through the laws of proba¬ 
bility, the probability density of / is always connected to the 
probability density of u by 

P{f\J) = {8U{u)-f))^, (10) 

for every state of knowledge J, which in our case produces 
two independent relations, for J = I and J = Iq, namely 


(b) The inference considering / as a function of u is consis¬ 
tent with the inference considering / as a fundamental 
variable. In other words, it should be possible to ignore 
the degrees of freedom u and perform an inference over 
/ itself, using the same functional M. This inference has 
to be valid and consistent with the full inference using 
u. 


P{f\I) = J d^u P{u\I)6{f{u) - /), (11) 

P(/|/o) = J d^u P{u\lQ)S{f{u) - /). (12) 

By replacing Eqs. [8] and |9] we find a constraint for M, 
namely 


(c) Logically independent subsystems Ui and U 2 can be 
analyzed separately or jointly asU = Ui®U 2 , producing 
the same result. This does not restrict the method to 
separable systems only, it only ensures that if we decide 
to apply it to a pair of independent systems, the method 
should preserve their independence. The same functional 
M must be valid for systems of arbitrary correlation. 

* Note that, if A is a scalar (i.e. invariant under a change of coordinate 
system) then (A)^ is also a scalar, because P{u\J) is a scalar density, which 
transforms just as the invariant measure g{u). Therefore we do not include 
,Jg explicitly in the integral. 


P{u\lQ)S{f{u) - /) M[f]{u,F) - M[I](/,F) 


= 0 
(13) 


for any prior P{u\Io). The delta function allows us to replace 
/ by f{u) in the arguments to M[I], so we have 


d^u P{u\lQ)S{f{u)-f) M[f]{u, F)-M[I]{f{u), F) 


= 0 . 


(14) 

Then, taking the functional derivative on both sides with 
respect to the prior, we have 
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M[f]{u,F) = M[l]{f{u),F) = m{f{u),F). (15) 

This means that for any constraining function / the func¬ 
tional M reduces to a function of f{u) (the constraining 
function evaluated at the state u) and F (the constraining 
value); it does not, for instance, depend on derivatives of /. 
We have then constrained the form of the posterior distribution 
to be 

P{u\I) = P(m|/o)to(/(m), F) (16) 

with m a unique function, to be determined. 

IV. A THEOREM EOR POSTERIOR EXPECTATIONS 

In order to find this universal function m{f,F) we first 
present a useful identity between expectations. Applying 
Stoke’s theorem to an arbitrary probability density P{u\J) we 
obtain (see the Appendix) the following expectation identity, 

<^i;-VlnP(M|J)^^ = 0, (17) 

valid for any differentiable vector function v{u). This is a 
generalization of the result in Ref. m. 

Replacing the form of the posterior found in Eq. [16] we 
have, for the state of knowledge I, 

(v-v'j + (v-V \nP{u\lQ)'j =—(v-V \nm{f{u), F)'j , 

' ' M) 

or, more compactly, 

(Vov)^ = (x{fiu),F)vVf)^, (19) 

where the operator Vq is defined as Vov = W ■ v + v ■ 

V lnP(M|/o) and 

A(/,F) =-I. Inm(/,P). (20) 

of 

In the next section we will constrain the form of this 
function A using the condition (c) on independent systems. 
We will show that A only depends on its second argument F. 


In both cases we are forced, by condition (a), to use Eq. [T6| 
with the same function m on each case. From Eqs. |2T]and|22| 
applied separately on Ui and U 2 , we obtain 

= (A(/i(Mi),Pi)r;i • (24) 

(VqV2{u 2)^ ^ = (\{f2{u2),F2)v2 ' V f-^ (25) 

and from Eq. |2^ 

(Vov{u))^ = (A(/i(Mi) + /2(M2),Fl+i^2)-*t-V/)^. (26) 

Choosing first v{ui) = Vi and then v{u 2 ) = V 2 we find 
(calling fi 2 = fi+ / 2 ), 

(a(/i 2, Pi + F2)v, ■ V/i)^ = (a(/i, Pi)t;i • (27) 

(A(/i2, Pi + F2)v2 ■ V/ 2 )^ = (a(/2, F2)v2 ■ V/ 2 )^, (28) 

for every choice of Vi and V 2 . This implies 

A(/i 2 ,Pi +P 2 ) = A(/i,Pi) = A(/ 2 ,P 2 ). (29) 

As /i only depends on iti, and /2 only depends on U 2 , the 
last equality means that, in general, the function A(/, P) does 
not depend on /; it is only a function of the second argument 
P, so that we can replace A(/(tt),P) by A(P) such that 

A(Pi+P2) = A(Pi) = A(P2). (30) 

Replacing A(/(m), P) by A(P) into the definition of A (Eq. 
I 20 I 1 . we have 

- ^lnTO(/,P) = A(P), (31) 

Of 

and therefore, 

m{f, F) = TOo(P) exp(-A(P)/). (32) 

The posterior distribution for u in Eq. [T6| then reads, 

P{u\I) = mo(P)exp(-A(P)/(M))P(M|/o), (33) 

with mo{F) fixed by normalization to be 


V. Consistency between separate and joint 

TREATMENT OE INDEPENDENT SUBSYSTEMS 
Now let US consider the following situation: two logically 
independent systems, Ui and U 2 with priors P(tti|/o) and 
P(m2|7o) respectively. We can decide to update these priors 
separately using the evidence E given by 


/l(wl)/^ = Pl> 

(21) 

J2{U2)^ ^ = F 2 . 

(22) 


We could also decide to update the joint system prior 
P{u\Iq) = P{ui\Iq)P{u 2 \Io), using the evidence 

(/i(«i) + /2(m2))^ = Pi+P2. (23) 


mo{F) 


= f cTu exp(-A(P)/(M))P(M|/o) = ^(A(P)). 

(34) 

As all the dependence on P is through A(P), we can finally 
write the distribution entirely as a function of A, so 


P{u\T) = exp(-A/(M))P(M|/o), (35) 

with A = A(P) a number to be determined to agree with the 
constraint = P. Given this functional form of P{u\I) 

we can write the expectation of / as a derivative of InZ(A), 

{f{u))^ = -^\nZ{X) (36) 

and then it follows that the constraint fixes A through Eq. [3 
as expected. 
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VI. Conclusion 

We have proved that Bayesian updating given an expectation 
constraint can be established as a uniquely defined procedure, 
leading to an exponential family posterior density, which is 
the same result produced by the application of the principle 
of maximum entropy (MaxEnt) under the same constraints. 
We only require that a unique “updating factor” is used in all 
cases, and that its use is consistent through different definitions 
of the state space. That the same answer is revealed using 
an alternative method of inference shows that the essence of 
MaxEnt is already contained in Bayes’ theorem under our 
additional consistency requirements. This allows a conceptual 
unification of the fields of Bayesian inference and MaxEnt 
inference under a common framework for continuous degrees 
of freedom. Our derivation also shows that the existence of 
an entropy functional is not central to the core of inference; 
although it certainly provides a more than convenient device 
for practical calculations, it is conceptually not required for a 
consistent formulation of a theory of inference. 


Appendix 

An identity eor expectations oe continuous 

VARIABLES 

We will consider an n-dimensional manifold U with metric 
tensor g^y{u) and induced metric on the surface dU 

which acts as a boundary of U. Recall the covariant form of 
Stoke’s theorem iia. 



with u) any differentiable field, n the normal to dU and 
the covariant divergence, defined as 


= d^uj^ + In y/g. 
Now consider the field 


V9 




(38) 


(39) 


where the probability density P{u\J) vanishes in the boundary 
dU. Replacing in Eq. we have 
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J d^uP{u\J)(df,vf^ + v^^df,lnP{u\J)'^ 

= [ d^-^uVh = 0 . 

Jdu Vd 

Defining the expectation (^)j as 

= J d"'uP{u\J)A{u), 

we arrive at the identity 


(40) 


(41) 


+ (^v>^d^lnP{u\J)y^ = 0 . 

This can be written in a manifestly covariant manner as 


(42) 


( 43 ) 





